A Tight Lower Bound Instance for k-means++ in Constant Dimension
نویسندگان
چکیده
The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial k centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For i > 1, pick a point to be the i center with probability proportional to the square of the Euclidean distance of this point to the closest previously (i− 1) chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an O(log k) approximation in expectation as shown by Arthur and Vassilvitskii [7]. There are datasets [7, 3] on which this seeding algorithm gives an approximation factor of Ω(log k) in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say 1/poly(k)). Brunsch and Röglin [9] gave a dataset where the k-means++ seeding algorithm achieves an O(log k) approximation ratio with probability that is exponentially small in k. However, this and all other known lower-bound examples [7, 3] are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an O(log k) approximation ratio with probability exponentially small in k. This solves open problems posed by Mahajan et al. [13] and by Brunsch and Röglin [9].
منابع مشابه
A bound for Feichtinger conjecture
In this paper, using the discrete Fourier transform in the finite-dimensional Hilbert space C^n, a class of nonRieszable equal norm tight frames is introduced and using this class, a bound for Fiechtinger Conjecture is presented. By the Fiechtinger Conjecture that has been proved recently, for given A,C>0 there exists a universal constant delta>0 independent of $n$ such that every C-equal...
متن کاملSample Complexity of Testing the Manifold Hypothesis
The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimizatio...
متن کاملThe mixing time for simple exclusion
We obtain a tight bound of O(L log k) for the mixing time of the exclusion process in Z/LZ with k ≤ 1 2 L particles. Previously the best bound, based on the log Sobolev constant determined by Yau, was not tight for small k. When dependence on the dimension d is considered, our bounds are an improvement for all k. We also get bounds for the relaxation time that are lower-order in d than previous...
متن کاملA Bad Instance for k-Means++
k-means++ is a seeding technique for the k-means method with an expected approximation ratio of O(log k), where k denotes the number of clusters. Examples are known on which the expected approximation ratio of k-means++ is Ω(log k), showing that the upper bound is asymptotically tight. However, it remained open whether k-means++ yields an O(1)-approximation with probability 1/poly(k) or even wi...
متن کاملA Tight Lower Bound for High Frequency Moment Estimation with Small Error
We show an Ω((n1−2/p logM)/ ) bits of space lower bound for (1 + )-approximating the p-th frequency moment Fp = ‖x‖p = ∑n i=1 |xi| of a vector x ∈ {−M,−M+1, . . . ,M} with constant probability in the turnstile model for data streams, for any p > 2 and ≥ 1/n (we require ≥ 1/n since there is a trivial O(n logM) upper bound). This lower bound matches the space complexity of an upper bound of Gangu...
متن کامل